Some Problems of Chinese Segmentation
نویسندگان
چکیده
In this paper, we discussed the main problems in Chinese segmentation. Firstly, machine segmentation ambiguity (MSA) was de ned formally. The automatic identi cation of MSA and types of ambiguities was emphasized as a most important step of Chinese segmentation. Then, we summarized the existing algorithms of Chinese segmentation (including the identi cation of Chinese names) with theoretical comparisons. Finally, to reach the statistically best result of segmentation, we proposed dynamic machine learning of lexicon.
منابع مشابه
An Enhanced Model for Chinese Word Segmentation and Part-of-Speech Tagging
This paper will present an enhanced probabilistic model for Chinese word segmentation and part-of-speech (POS) tagging. The model introduces the information of Chinese word length as one of its features to reach a more accurate result. And in addition, the model also achieves the integration of segmentation and POS tagging. After presenting the model, this paper will give a brief discussion on ...
متن کاملCan Word Segmentation be Considered Harmful for Statistical Machine Translation Tasks between Japanese and Chinese?
Unlike most Western languages, there are no typographic boundaries between words in written Japanese and Chinese. Word segmentation is thus normally adopted as an initial step in most natural language processing tasks for these Asian languages. Although word segmentation techniques have improved greatly both theoretically and practically, there still remains some problems to be tackled. In this...
متن کاملIdentification of Chinese Personal Names in Unrestricted Texts
Automatic identification of Chinese personal names in unrestricted texts is a key task in Chinese word segmentation, and can affect other NLP tasks such as word segmentation and information retrieval, if it is not properly addressed. This paper (1) demonstrates the problems of Chinese personal name identification in some IT applications, (2) analyzes the structure of Chinese personal names, and...
متن کاملExploiting Shared Chinese Characters in Chinese Word Segmentation Optimization for Chinese-Japanese Machine Translation
Unknown words and word segmentation granularity are two main problems in Chinese word segmentation for ChineseJapanese Machine Translation (MT). In this paper, we propose an approach of exploiting common Chinese characters shared between Chinese and Japanese in Chinese word segmentation optimization for MT aiming to solve these problems. We augment the system dictionary of a Chinese segmenter b...
متن کاملA Study of Chinese Word Segmentation Based on the Characteristics of Chinese
This paper introduces the research on Chinese word segmentation (CWS). The word segmentation of Chinese expressions is difficult due to the fact that there is no word boundary in Chinese expressions and that there are some kinds of ambiguities that could result in different segmentations. To distinguish itself from the conventional research that usually emphasizes more on the algorithms employe...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2001